ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739

Alcpz · 2025-10-23T12:03:18Z

This PR improves q4_k_q8_k gemm and gemv in arm64 using i8mm and vecdot instructions.

Tested on an Apple M4 Max:

REPACK vs NO REPACK

model	backend	threads	test	REPACK t/s	NO REPACK t/s	speedup
lfm2 1.2B Q4_K	CPU	8	pp256	683.08 ± 1.56	408.04 ± 0.78	1.67
lfm2 1.2B Q4_K	CPU	8	tg128	235.38 ± 0.81	214.06 ± 1.33	1.10
lfm2 700M Q4_K	CPU	8	pp256	1070.35 ± 0.72	645.21 ± 15.57	1.66
lfm2 700M Q4_K	CPU	8	tg128	335.03 ± 0.86	311.28 ± 3.89	1.08
llama 8B Q4_K	CPU	8	pp256	98.37 ± 1.57	60.62 ± 0.20	1.62
llama 8B Q4_K	CPU	8	tg128	42.25 ± 0.52	38.33 ± 0.38	1.10
qwen3 8B Q4_K	CPU	8	pp256	92.10 ± 0.35	60.11 ± 0.21	1.53
qwen3 8B Q4_K	CPU	8	tg128	40.60 ± 0.35	37.22 ± 0.42	1.09

REPACK: 8a2fd93 (7070)
NO REPACK: 45c6ef7 (7058)

Perplexity

model	REPACK PPL	NO REPACK PPL
LFM2 700M Q4_K_M	20.2207 ± 0.86775	20.2207 ± 0.86775
Qwen3 8B 128K Q4_K_M	10.8862 ± 0.46691	10.8862 ± 0.46691
LFM2 1.2B Q4_K_M	15.5833 ± 0.62558	15.5833 ± 0.62558

As for test-backend-ops, I've checked the output of the layer tensors manually comparing REPACK vs master, since #16182 is still ongoing.

ggml/src/ggml-cpu/arch/arm/repack.cpp

Alcpz · 2025-10-31T11:40:37Z

@ggerganov is there something else needed from my side or are we waiting another review?

ggerganov · 2025-10-31T12:10:33Z

There seems to be a bug somewhere. Here is repro on M4 Max:

../scripts/get-wikitext-2.sh
make -j && ./bin/llama-perplexity -hf LiquidAI/LFM2-2.6B-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -dev none

...

# PPL sky-rockets:
0.01.007.961 I perplexity: calculating perplexity over 581 chunks, n_ctx=512, batch_size=2048, n_seq=4
0.05.476.977 I perplexity: 4.47 seconds per pass - ETA 10.82 minutes
[1]6.8941,[2]1485.3563,[3]8468.4132,[4]21269.3291,[5]4800.3655,[6]9365.2385,[7]15453.2190,[8]22744.0153,^C

#16739 (comment)

Alcpz · 2025-10-31T13:17:44Z

I was able to replicate the PPL skyrocketing with the generic implementation as well:

# ggml_gemm_q4_K_8x8_q8_K_generic
perplexity: 34.48 seconds per pass - ETA 1.43 minutes
[1]9.6770,[2]1762.7802,[3]9505.4348,[4]22802.6452,[5]5311.2750,[6]10333.9703,[7]16582.8044,[8]23315.3388,[9]11093.7993,[10]14942.7293,

# ggml_gemm_q4_K_8x8_q8_K
perplexity: 2.71 seconds per pass - ETA 0.10 minutes
[1]9.7353,[2]1764.9156,[3]9519.3014,[4]22839.7651,[5]5320.7637,[6]10348.6530,[7]16591.6868,[8]23311.9378

I'll try to figure out what is going on.

Edit:

# Q4_0 Model
perplexity: 1.84 seconds per pass - ETA 0.07 minutes
[1]9.9763,[2]1820.5697,[3]9757.8288,[4]23501.0590,[5]5479.2732,[6]10610.0991,[7]17050.2390,[8]23943.4191,[9]11327.5779,[10]15263.4054,

Also happens with Q4_0 repack. Interesting that it happens from the second chunk onwards. I'll try to run in an AVX machine and see if it's something totally unrelated to the GEMMs themselves

I also compared the tensor outputs of all mul mats for a couple of llama-eval-callback runs and the results were quite identical, except for the 0.0001 deviation here and there.

What I don' t understand is how I was able to run the PPL with LFM correctly, I may have messed up the GGML_CPU_REPACK in the build, sorry about that.

ggerganov · 2025-10-31T14:18:21Z

Hm yes - Q4_0 with LFM is indeed also problematic. However Q4_0 with llama 3.1 8B is good. So this means there is a bug that occurs only for certain shapes.

Alcpz · 2025-11-05T17:04:38Z

I've opened #17030 for the fix.

Hm yes - Q4_0 with LFM is indeed also problematic. However Q4_0 with llama 3.1 8B is good. So this means there is a bug that occurs only for certain shapes.

As you pointed out, LFM2 had some MAT_MUL layers with a (6144, 256, 2, 1) tensor, wher only the first 6144*256 elements were multiplied.

Signed-off-by: Alberto Cabrera <[email protected]>

Alcpz · 2025-11-18T11:52:59Z

@ggerganov #17241 fixed the perplexity issues. So this PR is again ready for review (It's rebased on top of master).

Alcpz · 2025-11-20T12:21:30Z

@ggerganov sorry for pinging again! I don't have merge rights. Could you please?

ggerganov · 2025-11-20T12:23:09Z

It's pending review by @slaren

Alcpz · 2025-11-20T16:31:01Z

Ah, Sorry for the misunderstanding! I got another merged in with a single review and didn't realize both approvals were needed. Thanks!

ggml/src/ggml-cpu/arch/arm/repack.cpp

slaren · 2025-11-20T16:54:28Z

ggml/src/ggml-cpu/arch/arm/repack.cpp

+    UNUSED(s);
+    UNUSED(bs);
+    UNUSED(vx);
+    UNUSED(vy);
+    UNUSED(nr);
+    UNUSED(nc);


I don't think these are necessary.

Fixed, thanks for catching that. Want me to remove those as well from the other implementations that also call the generic? They have the same problem

slaren · 2025-11-20T16:56:51Z

ggml/src/ggml-cpu/arch/arm/repack.cpp

+    UNUSED(ncols_interleaved);
+    UNUSED(blocklen);
+
+#if !((defined(_MSC_VER)) && !defined(__clang__)) && defined(__aarch64__) && defined(__ARM_NEON)


What's the reason for excluding MSVC here? There is no inline assembly.

No reason, I copied the q4_0 gemv guards and began to work from there, but as you've pointed out, seeing the implementations in quants.c, it seems safe to remove !((defined(_MSC_VER)) && !defined(__clang__)). Do you want me to also remove the guards from ggml_gemv_q4_0_4x8_q8_0, ggml_gemv_iq4_nl_4x4_q8_0, ggml_gemv_q4_0_4x4_q8_0 and ggml_gemm_iq4_nl_4x4_q8_0? Those don't have inlined assembly

Co-authored-by: Diego Devesa <[email protected]>

Alcpz requested review from ggerganov and slaren as code owners October 23, 2025 12:03

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Oct 23, 2025

Alcpz changed the title ~~ggml-cpu: arm64: q4_K repack gemm and gemv implementations~~ ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) Oct 27, 2025

ggerganov previously approved these changes Oct 27, 2025

View reviewed changes

ggml/src/ggml-cpu/arch/arm/repack.cpp Outdated Show resolved Hide resolved

Alcpz mentioned this pull request Nov 5, 2025

ggml-cpu: handle 3d tensors in repack mat_mul #17030

Merged

DajanaV mentioned this pull request Nov 5, 2025

UPSTREAM PR #17030: ggml-cpu: handle 3d tensors in repack mat_mul auroralabs-loci/llama.cpp#94

Open

Alcpz added 12 commits November 14, 2025 11:15

Enabled q4_K_8x8_q8_K path on ARM

b3011aa

wip: I8mm qs multiplication, pending bias

d00dbf4

cpu : arm : REPACK gemm q4_K8x8 implementation

b9b0b36

Signed-off-by: Alberto Cabrera <[email protected]>

Guard gemm with proper features, improved superblock scale and min calc

f956373

Signed-off-by: Alberto Cabrera <[email protected]>

cpu: arm: Implemented REPACK gemv for Q4_K

eb8449b

Signed-off-by: Alberto Cabrera <[email protected]>

Removed completed TODO

8df2511

Fixed missing guards when selecting optimal repack type for Q4_K

a66f669

Signed-off-by: Alberto Cabrera <[email protected]>

Fixed macro guard for gemv

1d7738e

Fixed wrong comment in GEMV

b36356f

Fixed warning for unused variable

92f61ea

vdotq_s32 -> ggml_vdotq_s32

b0b5a27

Signed-off-by: Alberto Cabrera <[email protected]>

Clang-format issues

8a2fd93

Alcpz force-pushed the Alcpz/arm_q4_k_repack branch from 7081eda to 8a2fd93 Compare November 14, 2025 12:25

Alcpz requested a review from ggerganov November 20, 2025 10:59

ggerganov approved these changes Nov 20, 2025

View reviewed changes

slaren reviewed Nov 20, 2025

View reviewed changes

Alcpz and others added 3 commits November 20, 2025 21:58

Apply suggestions from code review

a8a1d19

Co-authored-by: Diego Devesa <[email protected]>

Removed unnecessary GGML_UNUSED

486c027

Fixed guards in q4_k gemm and gemv (repack)

96af439

ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739

Are you sure you want to change the base?

ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) #16739

Conversation

Alcpz commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

REPACK vs NO REPACK

Perplexity

Uh oh!

Uh oh!

Alcpz commented Oct 31, 2025

Uh oh!

ggerganov commented Oct 31, 2025

Uh oh!

Alcpz commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 31, 2025

Uh oh!

Alcpz commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alcpz commented Nov 18, 2025

Uh oh!

Alcpz commented Nov 20, 2025

Uh oh!

ggerganov commented Nov 20, 2025

Uh oh!

Alcpz commented Nov 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slaren Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Alcpz Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Alcpz Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Alcpz commented Oct 23, 2025 •

edited

Loading

Alcpz commented Oct 31, 2025 •

edited

Loading

Alcpz commented Nov 5, 2025 •

edited

Loading